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Abstract 

In this paper, we present ii^p multi-task structure learning for Gaussian graphical 
models. We discuss the uniqueness and boundedness of the optimal solution of the max- 
imization problem. A block coordinate descent method leads to a provably convergent 
algorithm that generates a sequence of positive definite solutions. Thus, we reduce the 
original problem into a sequence of strictly convex £p regularized quadratic minimization 
subproblems. We further show that this subproblem leads to the continuous quadratic 
knapsack problem for p — oo and to a separable version of the well-known quadratic 
trust-region problem for p — 2, for which very efficient methods exist. Finally, we show 
promising results in synthetic experiments as well as in two real-world datasets. 



1 Introduction 



Structure learning aims to discover the topology of a probabilistic graphical model such that 
this model represents accurately a given dataset. Accuracy of representation is measured 
by the likehhood that the model explains the observed data. From an algorithmic point of 
view, one challenge faced by structure learning is that the number of possible structures 
is super-exponential in the number of variables. From a statistical perspective, it is very 
important to find good regularization techniques in order to avoid over-fitting and to achieve 
better generalization performance. Such regularization techniques will aim to reduce the 
complexity of the graphical model, which is measured by its number of parameters. 

For Gaussian graphical models, the number of parameters, the number of edges in the 
structure and the number of non-zero elements in the inverse covariance or precision matrix 
are equivalent measures of complexity. Therefore, several techniques focus on enforcing 
sparseness of the precision matrix. An approximation method proposed in |Meinshausen and 
Biihlmann 2006 relied on a sequence of sparse regressions. Maximum likelihood estimation 



with an 



?i-norm penalty for encouraging sparseness is proposed in Banerjee et al. 

120071. 



Friedman et al. 2007 , Yuan and Lin 



2006 



Structure learning techniques are very useful for analyzing datasets for which probabilis- 
tic dependencies are not known apriori. For instance, these techniques allow for modeling 
interactions between brain regions, based on measured activation levels through imaging. 
Suppose that we want to learn the structure of brain region interactions for one person. 
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We can expect that the interaction patterns in the brains of two persons are not exactly 
the same. On the other hand, when learning the structure for one person, we would like to 
use evidence from other persons as side information in our learning process. This becomes 
more important in settings with limited amount of data and high variability, such as in 
functional magnetic resonance image (fMRI) studies. Multi-task learning allows for a more 
efficient use of training data which is available for multiple related tasks. 

In this paper, we consider the computational aspect of ii^p multi-task structure learning, 
which generalizes the learning of sparse Gaussian graphical models to the multi-task setting 
by replacing the ^i-norm regularization with an £i^p-norm, also known as the simultaneous 
prior |Turlach et al. , 2005 , Tropp , 2006 for p = oo or the group-sparse prior [Yuan and Lin 



2006, Meier et al. 2008 for p = 2 



Our contribution in this paper is three- fold. First, we present a block coordinate descent 
method which is provably convergent and yields sparse and positive definite estimates. 
Second, we show the connection between our ii^p multi-task structure learning problem and 
the continuous quadratic knapsack problem for p = oo, which allows us to use existing 
efficient methods [Helgason et al. 1980 Brucker, 1984, Kiwiel 2007 . We also show the 
connection between our multi-task structure learning problem and the quadratic trust- 
region problem for p = 2, which can be efficiently solved by one-dimensional optimization. 
Third, we discuss penalization of the diagonals of the precision matrices and experimentally 
show that penalizing the diagonals does not lead to better generalization performance, when 
compared to not penalizing the diagonals. 



Compared to our related shorter conference paper [Honorio and Samaras 2010 



we 



present a more general framework which assumes p > 1, while Honorio and Samaras 
2010 assumes p = oo. We present a new algorithm for p = 2 and experimentally show 
that our method recovers the ground truth edges and the probability distribution always 
better than the 



2 method of Varoquaux et al. 



2010 



for every regularization level. We 
discuss penalization of the diagonals of the precision matrices which leads to additional 
optimization problems, namely the continuous logarithmic knapsack problem for p = oo 
and the separable logarithmic trust-region problem for p = 2. We show that our method 
outperforms others in recovering the topology of the ground truth model through synthetic 
experiments. In addition to the small fMRI dataset used in [Honorio and Samaras 2010 



we include validation in a considerably larger fMRI dataset. We experimentally show that 
the cross- validated log-likelihood of our method is higher than competing methods in both 
real- world datasets. 



Section [2] introduces Gaussian graphical models as well as techniques for learning such 
structures from data. Section [3] sets up the £i^p multi-task structure learning problem 
and discusses some of its properties. Section |4] describes our block coordinate descent 
method. Section [5] shows the connection to the continuous quadratic knapsack problem. 
Section [6] shows the connection to the quadratic trust-region problem. Section [T] presents 
our algorithm in detail. Section [8] discusses penalization of the diagonals of the precision 
matrices. Experimental results are shown and explained in Section |9| Main contributions 
and results are summarized in Section [Tol 
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Notation 


Description 


c 1 


€i-norm of c G R" , i.e. J2n 1'^" 


II C II oo 


£oo-norm of c G R'^, i.e. max„ |c„ 


Ilc|l2 


Euclidean norm of c G R^, i.e. \/X]„ 


diag(c) e R^><^ 


matrix with elements of c G R'^ on its diagonal 


A ^ 


A G R^*^^ is symmetric and positive semidefinite 


A ^ 


A G R^*^^ is symmetric and positive definite 


l|A||i 


^i-norm of A G R"^'^, i.e. Wmnl 


l|A||oo 


£oo-norm of A G R^''^^^, i.e. maxm„ \ am.n\ 


l|A||2 


spectral norm of A G R^^^, i.e. the maximum eigenvalue of A ;^ 


liAlls 


Frobenius norm of A G R"'''^, i.e. \/Em„ "mn 


(A,B) 


scalar product of A, B G R*^^^, i.e. 5^ 



Table 1: Notation used in this paper. 



2 Background 

In this paper, we use the notation in Table [l} 

A Gaussian graphical model is a graph in which all random variables are continuous 
and jointly Gaussian. This model corresponds to the multivariate normal distribution for 
variables with covariance matrix S E M^^^. Conditional independence in a Gaussian 
graphical model is simply reflected in the zero entries of the precision matrix Q = 
[Lauritzen 1996 . Let fl = {u!nin2}j two variables ni and n2 are conditionally independent 
if and only if ujnin2 = 0. 

The concept of robust estimation by performing covariance selection was first introduced 
in Dempster [1972 where the number of parameters to be estimated is reduced by setting 
some elements of the precision matrix fl to zero. Since finding the most sparse precision 



matrix which fits a dataset is a NP-hard problem [Banerjee et al. 2006 , in order to overcome 
it, several ^i-regularization methods have been proposed for learning Gaussian graphical 
models from data. 

Given a dense sample covariance matrix S ^ 0, the problem of finding a sparse precision 
matrix Q by regularized maximum likelihood estimation is given by: 



max I 



ft) 



P 



(1) 



for regularization parameter p > 0. The term encourages sparseness of the precision 

matrix or conditional independence among variables, while the term ^g(ri) is the Gaussian 
log-likelihood, and it is defined as: 



logdetn - 



(2) 



Several optimization techniques have been proposed f or eg. ([!]): a se quence of box- 
constrained quadratic programs in the covariance selection [Banerjee et al. 2006], solution 
of the dual problem by sparse regression in the graphical lasso [Friedman et al.^ ^2007j or an 
approximation via standard determinant maximization with linear inequality constraints in 
. Instead of solving eq.Q, the Meinshausen-Buhlmann approximation 



Yuan and Lin 



2007 



Meinshausen and Biihlmann, 2006| obtains the conditional dependencies by performing a 



sparse linear regression for each variable, by using lasso regression (Tibshirani 1996 . 
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Besides sparseness, several regularizers have been proposed for Gaussian graphical mod- 
block 



els for single-task learning, for enforcing diagonal structure Levina et al. , 2008 



structure for known block-variable assign ments [Duchi et al. 



and unknown block-variable assignments Marlin and K. Murphy 



2008a 



Schmidt et al. 



2009 



Marlin et al. 



spatial coherence Honorio et al. , 20091 , sparse changes in controlled experiments Zhang 



and Wang 2010 , power law regularization in scale free networks iLiu and Ihler, 2011 



2009 



2009 



or 



variable selection Honorio et al. , 2012 



Multi-task learning has been applied to very diverse problems, such as linear regression 

compressive sensing [Qi et alT 2008 



|Liu et al. 




2009a|b|, classification [Jebara 


reinforcement learning Wilson et al. 2007 


[Niculescu- 


Mizil and Caruana, 2007 . 



2004 



2007 and structure learning of Bayesian networks 



Structure learning through £i-regularization has been also proposed for different types 
of graphical models: Markov random fields (MRFs) by a clique selection heuristic and 

Bayesian networks on binary variables by logistic 



approximate inference [Lee et al 
regression 



2006 



Schmidt et al. , 2007 ; Conditional random fields by pseudo- likelihood and block 



regularization in order to penalize all parameters of an edge simultaneously ^Schmidt et al. 



2008 ; and Ising models, i.e. MRFs on binary variables with pairwise interactions, by logistic 



regression 
(2006 . 



Wainwright et al. , 2006 which is similar in spirit to Meinshausen and Biihlmann 



3 Preliminaries 

In this section, we set up the problem and discuss some of its properties. 
3.1 Problem Setup 

We propose a prior that is motivated from the multi-task learning literature. Given K 
arbitrary tasks, our goals are to learn one structure for each task that best explains the 
observed data, and to promote a common sparseness pattern of edges for all tasks. 

For a given task k, we learn a precision matrix fi^'^) G M^^^ for N variables. Our 



(1) 



nin2 ) • • • ) ^nin2 



(K) 



multi-task regularizer penalizes corresponding edges across tasks (i.e. a;. 
Let S^^^ ^ be the dense sample covariance matrix for task k, and T^''^ > be the number 
of samples in task k. The ii^p multi-task structure learning problem is defined as: 



max 

(Vfc) n('=)^o 



(3) 



for regularization parameter p > and ^i^p-norm for p > 1. The term ^g(fc) (fi^'^)) is the 
Gaussian log-likelihood defined in eq.Q, while the term ||ri||i^p is our multi-task regularizer, 
and it is defined as: 

~ (a;« ,...,a;W (4) 



We assume that p > 1, since for p = 1, the multi-task problem in eq.Q reduces to 
K single-task problems as in eq. and for p < 1, eq.Q is not convex. The number of 
samples T^^^ is a term that is usually dropped for covariance selection and graphical lasso 
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as in eq.Q. For the multi-task structure learning problem, it is important to keep this term 
when adding the log-likelihood of several tasks into a single objective function. 

The ii^2 multi-task structure learning problem was originally proposed in [Varoquaux 
et al. 2010 , where the authors minimize the original non-smooth objective function by using 



a sequence of smooth quadratic upper bounds. Varoquaux et al. 2010 do not provide any 
guarantee of positive definiteness, eigenvalue bounds or convergence. The £1^00 multi-task 



problem was originally proposed in Honorio and Samaras 12010 . In this paper, we analyze 



the computational aspects of the more general £i^p multi-task problem for p > 1. While the 
£1^00 multi-task problem of Honorio and Samaras [2010 leads to the continuous quadratic 



knapsack problem, we show that the £1^2 multi-task problem leads to the quadratic trust- 
region problem. Another multi-task penalty has been proposed in Guo et al. [2010 , however 
this penalty is non-convex. 



3.2 Bounds 

In what follows, we discuss the uniqueness and boundedness of the optimal solution of the 
multi-task structure learning problem. 

Lemma 1. For p > and p > 1, the £i^p multi-task structure learning problem in eg.Q 
is a maximization problem with a concave (but not strictly concave) objective function and 
convex constraints. 

Proof. The Gaussian log-likelihood defined in eq.Q is concave, since logdet is concave on 
the space of symmetric positive definite matrices and (•, •) is a linear operator. The multi- 
task regularizer defined in eq.Q is a non-smooth convex function. Finally, fi'^'^^ ;^ is a 
convex constraint. □ 

Theorem 2. For p > and p > 1, the optimal solution to the ii^p multi-task structure 
learning problem in eq.^ is unique and bounded as follows: 



1 



|5]W||2 + 



Np 



(5) 



Proof. Let the £p/-norm be the dual of the £p-norm, i.e. - + 



1. By using the identity 



for dual norms /o||c||p = max||a||^,<p a'^c in eq.([3]), we get: 

n j;T('=)LgdetnW-(E(^') + ^,fiW) 

T.in2\\pl<P \ ^ 



max mm 

(Vfc) n('=)^0 (Vninz) ||a„ 



(6) 



V"'ni?i2 1 ■ 



T 



anrn2) cind A*-^-* G M^^^. By virtue of Sion's minimax theorem. 



where a„^„2 

we can swap the order of max and min. Furthermore, note that the optimal solution of 



the inner equation is independent for each k and is given by ^l^^^ = (Xl^^^ + ^oe) 
replacing this solution in eq. ^ , we get the dual problem of eq. ^ : 



mm 

(Vnin2) llanj^nj llp'<P 



J^tC^) logdet ("sC^) 

k \ 



+ 



aW 



NK 



■ By 



(7) 
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In order to find a lower bound for the minimum eigenvalue of fi^^^*, note that ||ri(^)* ||2 = 
W^'^'^ + ^h < \\^^'% + \\^h = l|S('=)||2 + ^||AW|b < ||S(^)||2 + ^||AW||,. Since 
||a„j„2||p/ < p, it follows that lan^^njl ^ P therefore ||A(^)||t^ < Np. 

In order to find an upper bound for the maximum eigenvalue of fl^'^-' , note that, at 
optimum, the primal-dual gap is zero: 

- NK + ^r('=)(s(^\ n^*) + p\\n*\\i,p = o (8) 

k 

The upper bound is found as follows: ||fi('^)*||2 < ||fi'^'^)* < Uri^*^)*!]! < ||f2*||i,p = 
7Vi^-E.T(^)(s('^).nW') ^.^^^ -(fc) ^ Q ^(,)* ^ ^ ^^^^^^^ ^^^^ > 

0. □ 



4 Block Coordinate Descent Method 



In this section, we develop a block coordinate descent method for our £i_p multi-task struc- 
ture learning problem, and discuss some of its properties. 

Since the objective function in eq.([3]) contains a non-smooth regularizer, methods such as 
gradient descent cannot be applied. On the other hand, subgradient descent methods very 
rarely converge to non-smooth points Duchi and Singer, 2009 . In our problem, these non- 



smooth points correspond to zeros in the precision matrix, are often the true minima of the 
objective function, and are very desirable in the solution because they convey information 
of conditional independence among variables. 



We apply block coordinate descent method on the primal problem iHonorio et al. , 2009 



Honorio and Samaras 2010 Honorio et al. 



2012 



unlike covariance selection Banerjee 



et al. 2006| and graphical lasso [Friedman et al. 2007| which optimize the dual. We choose 
to optimize the primal because the dual formulation in eq.Q leads to a sum of K terms 
(logdet functions) which cannot be simplified to a quadratic problem unless K = 1. 

For clarity of exposition, we will first assume that the diagonals of . . . , fl^^^ are 
not penalized by our multi-task regularizer defined in eq. In Section [8| we will discuss 
penalization of the diagonals, for which an additional continuous logarithmic knapsack prob- 
lem for p = oo or separable logarithmic trust-region problem for p = 2 needs to be solved. 
We point out that all the following theorems and lemmas still hold in that case. 

Lemma 3. The solution sequence generated by the block coordinate descent method is 
bounded and every cluster point is a solution of the ii^p multi-task structure learning problem 
in eq.^. 

is separable into a sum of 0{N'^) 



Proof. The non-smooth regularizer ||r2| 



(1) 



n2 ) ■ 



{K) 



1,P 

\\v 



individual 



functions of the form ||(a;l\n2> • • • ^^n^n2)\\n- These functions are defined over blocks of K 



The objective function in eq.Q is continuous on a compact 



variables, i.e. ujn^ 

level set. By virtue of Theorem 4.1 in Tseng [2001 , we prove our claim. 



□ 



Theorem 4. The block coordinate descent method for the ii^p multi-task structure learning 
problem in eg.Q generates a sequence of positive definite solutions. 
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Proof. Maximization can be performed with respect to one row and column of all precision 
matrices $7^^^ at a time. Without loss of generality, we use the last row and column in our 
derivation, since permutation of rows and columns is always possible. Let: 



t>N-lxN-l 



y 

.(fc) „(fc) g 



yik) 



u 



(9) 



where WC^^S^^) G K'-'^^'-'^ y^"'^ 

In terms of the variables y^^\ z^''^ and the constant matrix W^'^^ the multi-task structure 
learning problem in eq.([3]) can be reformulated as: 



max 

(Vfe) ilW>-0 




,{k)Jk) 



(10) 



If ri^'^) is a symmetric matrix, according to the Haynsworth inertia formula, fi^^^ ^0 if 
and only if its Schur complement z^^'^ — y^^^ W^^^ y^'^^ > and W^'^^ >~ 0. By maximizing 
eq.(lO) with respect to z^^\ we get: 



y{k)'^^^{k)-^y{k) 



1 



(11) 



and since v'^^'^ > 0, this implies that the Schur complement in eq.(ll) is positive. 

Finally, in an iterative optimization algorithm, it suffices to initialize fl^^^ to a matrix 
that is known to be positive definite, e.g. a diagonal matrix with positive elements. □ 



Remark 5. Note that eq.{ll) defines the "diagonal update step" of the block coordinate de- 
scent method. For each k we set z^^^ to its optimal value, i.e. z*^'^-'* = -?W+y^^^ W*^'^) ^y'^'^^- 



Theorem 6. The "off-diagonal update step" of the block coordinate descent method for the 
ii^p multi-task structure learning problem in eg.Q is equivalent to solving a sequence of 
strictly convex ii^p regularized quadratic subproblems: 



mm 

(Vfc) yC") 



(EkT^'^ 



.(1) 



^ {k)^^{k)-'^y{k) ^^(k)'^y{k)'' 



,-..,y'n')\\ 



(12) 



Proof. By replacing the optimal z^''^ given by eq.(ll) into the objective function in eq.(lO), 
we get eq.(12). Since W^'^^ ^ =^ w^ik) y q hence eq.(12) is strictly convex. □ 

As we will show in Section Isl the Schur complement is still positive when we penalize the 
diagonals, i.e. z^^'^ — y^^^ ^^(k) y(k) — ^ ^ q Note that in such case, ^ 7^ in contrast 
to eq.( 11 ) but we can still perform the replacement in eq.( 10 ), and therefore Theorem [6] still 
holds when penalizing the diagonals. 

Lemma 7. Let the (.pi -norm be the dual of the Ip-norm, i.e. ^ + ^ = 1- If the £oo,p' norm 
max„ \\[T^^^Un \ ■ ■ ■ , T^^^u^^^) ||p/ < p, the ii^p regularized quadratic problem in eg.(12) has 
the minimizer (V/c) y'-'^-* = 0. 
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Proof. The problem in eq.(12) has the minimizer (V/c) y^'^-' = if and only if belongs to 
the subdifferential set of the non-smooth objective function at (V/c) y^'^) = 0, i.e. (3A G 
^7V-ixi^) (r(i)u(i), . . . ,rWuW) + A = A maxn ||(a„i, . . . ,a„K)||p' < P- This condition 
is true for max^ \\{T^^'>u'h\. ■ . , TWui^^)||p/ < p. □ 

Remark 8. By using Lemma^ we can reduce the size of the original problem by removing 
variables in which this condition holds, since it only depends on the dense sample covariance 
matrix. 

Theorem 9. The coordinate descent method for the ii^p regularized quadratic problem in 
eq.(12) is equivalent to solving a sequence of strictly convex ip regularized separable quadratic 
subproblems: 

(13) 



min I -X diag(q)x — c x + /9||x|L 

Proof. Without loss of generality, we use the last row and column in our derivation, since 
permutation of rows and columns is always possible. Let: 



^11 ^12 



h 



12 



where H^^'* G 



t,N-2xN~2 



hi2 ,yi 



22 

(fc) 
1 



, y 



(fc) 
yl 



u 



1 



u 



(14) 



i>Af-2 



In terms of the variable x and the constants qk 



, the ii^p regularized quadratic problem in eq.(12) can be reformulated as in eq.(13). 
Moreover, since (V/c) T^^) > OAt;('=) > 0a4? > ^ q > 0, and therefore eq.^ is strictly 
convex. □ 



5 Continuous Quadratic Knapsack Problem 



In this section, we show the connection between the multi-task structure learning problem 
and the continuous quadratic knapsack problem, for which very efficient methods exist. 
The continuous quadratic knapsack problem has been solved in several areas 



son et al. 
[Brucker 



1984 



Helga- 



1980 provides an ©(i^ log i^T) algorithm which initially sort the breakpoints, 
and later 



Kiwiel, 20071 provide deterministic linear-time algorithms by 



using medians of breakpoint subsets. In the context of machine learning. 



2008b] provides a randomized linear-time algorithm, while Liu et al. , 2009a 



0{K log K) algorithm. We point out that Duchi et al. , 2008b Liu et al. 



Duchi et al, 



provides an 



2009a 



that the weights of the quadratic term are all equal, i.e. (V/c) qk 
assume arbitrary positive weights, i.e. (Vfc) qk > 0. 

Theorem 10. For q > 0, p > 0, p = oo, the 



1. 



assume 
In this paper, we 



ioo regularized separable quadratic problem in 
eg. (13) is equivalent to the separable quadratic problem with one £i constraint: 



min I -(r — c)"'"diag(q) ^(r — c) j 



(15) 



Furthermore, their optimal solutions are related by x* = diag(q) ""^(c — r*) 
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Proof. By Lagrangian duality, the problem in eq.(15) is the dual of the problem in eq.(13). 
Furthermore, strong duality holds in this case. □ 



Remark 11. In eg. (15), we can assume that {\/k) Ck 7^ 0. // (3k) Ck = 0, the partial 
optimal solution is r|, = 0, and since this assignment does not affect the constraint, we can 
safely remove r^ from the optimization problem. 

Remark 12. In what follows, we assume that ||c||i > p. If ||c||i < p, the unconstrained 
optimal solution of eq.{lf>) is also its optimal solution, since r* = c is inside the feasible 
region given that ||r*||i < p. 

Lemma 13. For q > 0, (Vfc) 7^ 0, ||c||i > p, the optimal solution r* of the separable 
quadratic problem with one li constraint in eg.(15) belongs to the same orthant as the 
unconstrained optimal solution c, i.e. (V/c) r^Cfc > 0. 

Proof. We prove this by contradiction. Assume (B/ci) r^^c^j < 0. Let r be a vector such 
that rfcj = and (VA;2 7^ /^i) rk^ = rj,^- The solution r is feasible, since ||r*||i < p and 
II r 111 = II r* 111 — |r|^,J < p. The difference in the objective function between r* and r is 

i(r* -c)Tdiag(q)-\r*-c)-i(r-crdiag(q)-i(r-c) = ^(rl^-2ck,rl) > > 0. 
Thus, the objective function for r is smaller than for r* (the assumed optimal solution), 
which is a contradiction. □ 

Theorem 14. For q > 0, {\/k) Ck 7^ 0, ||c||i > p, the separable quadratic problem with one 
ii constraint in eg.(15) is equivalent to the continuous quadratic knapsack problem: 

min V 7r'-(S'fc - |cfc|)^ (16) 
g>o ^ 2gfc 

Furthermore, their optimal solutions are related by (VA;) = sgn{ck)gl.. 



Proof. By invoking Lemma 13 we can replace (Vfc) rj. = sgn{ck)gk, 5fc ^ in eq.(15). 
Finally, we change the inequality constraint I'^g < p to an equality constraint since ||c||i > p 
and therefore, the optimal solution must be on the boundary of the constraint set. □ 

Lemma 15. For q > 0, {\/k) Ck 7^ 0, ||c||i > p, the continuous quadratic knapsack problem 
in eg. (16) has the solution: 

gk{y) = max(0, |cfc| - uq^) (17) 
for some v, and furthermore, the optimal solution fulfills the condition: 

g* = g{i^) ^ lTg(j.) = p (18) 



Proof. The Lagrangian of eq.(16) is: 



in I (Qk 



miji I > . w:ri9k - \ck\y + ^^(i^g - p) I (19) 

Both results can be obtained by invoking the Karush-Kuhn- Tucker optimality conditions 
oneq.((T9l). □ 
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Remark 16. Note that gk[u) in eq.(\7) is a decreasing piecewise linear function with break- 

\Ck\ 



point V 



Qk 



> 0. By Lemma 



15, finding the optimal g* is equivalent to finding v in a 



piecewise linear function \ ^^{i>) that produces p. 
Lemma 17. For q > 0, (V/c) c^ 7^ 0, ||c||i > p, the continuous quadratic knapsack problem 



in eq.{lQ) has the optimal solution g"^ = max(0, \ck\ — i^*qk) for: 



> I/* 



El 



Ek* 
k=l 



_/3 ^ \'-TTk*+l I 



(20) 



where the breakpoints are sorted in decreasing order by a permutation vr of the indices 



1,2,. ..,K, I.e. ^>^> 



> 



= 0. 



Proof. Given k* , v* can be found straightforwardly by using the equation of the Une. In 



order to find k* , we search for the range in which I'^g 



< p < l^g 



|c7r^*_j_ll 



□ 



Theorem 18. For q > 0, p > 0, p = 00, the ioo regularized separable quadratic problem in 



eg. (13) has the optimal solution: 



|c||i < p =^ X* = 
|c||i > p Ak > k* = 

|c||i > pAk <k* = 



X' 



^^k 
i^k 



X 



TTfc 



sgn(c^J 



(21) 



Proof. For ||c||i < p, from Remark 12 we know that r* = c. By Theorem 10, the optimal 

solution of eq.(13) is x* = diag(q)~ (c — r*) = 0, and we prove the first claim. 

1 

i^k 

^(c^,-sgn(c^j5;j. ByLemma^'- - N^.„,..n 



For ||c||i > p, by Theorem 
Theorem 



14 



10 



the optimal solution of eq.(13) x*^ = — (cttj. — t"^,.)- By 



17 



T^k 



-sgn c 



maxfO, ^^^^ 
^ ' i^k 



lfk> k* 
Iik<k* 



> u* 



— , and we prove the second claim. 
i^k ' ^ 

sgn(c7r^)i/*, and we prove the third claim. 



□ 



6 Separable Quadratic Trust-Region Problem 

In this section, we show the connection between the £1.2 multi-task structure learning prob- 
lem and the separable quadratic trust-region problem, which can be efficiently solved by 
one-dimensional optimization. 

The trust-region problem has been extensively studied by the mathematical optimization 
community fForsythe and Golub 1965 More and Sorensen, 1983, Boyd and Vandenberghe 
[2006] . Trust -region methods arise in the optimization of general convex functions. In 
that context, the strategy behind trust-region methods is to perform a local second-order 
approximation to the original objective function. The quadratic model for local optimization 
is "trusted" to be correct inside a circular region (i.e. the trust region). Separability is 



usually not assumed, i.e. a symmetric matrix Q is used instead of diag(q) in eq.(13), and 
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therefore the general algorithms are more involved than ours. In the context of machine 
learning, [Duchi and Singer 2009 provides a closed form solution for the separable version 
of the problem when the weights of the quadratic term are all equal, i.e. (V/c) qk = 1- In 
this paper, we assume arbitrary positive weights, i.e. (V/c) > 0. A closed form solution is 
not possible in this general case, but the efficient one-dimensional Newton-Raphson method 
can be applied. 

Theorem 19. For q > 0, p > 0, p = 2, the (.2 regularized separable quadratic problem in 
eg. (13) is equivalent to the separable quadratic trust-region problem: 



mm 

Ikll2<p 



(r-c) diag(q) (r - c; 



(22) 



Furthermore, their optimal solutions are related by x* = diag(q) ^(c — r*) 



Proof. By Lagrangian duality, the problem in eq.(22) is the dual of the problem in eq.(13). 
Furthermore, strong duality holds in this case. □ 



Remark 20. In eg. (22), we can assume that iyk) c^ 7^ 0. // (3A;) = 0, the partial 
optimal solution is r^ = 0, and since this assignment does not affect the constraint, we can 
safely remove r^ from the optimization problem. 

Remark 21. In what follows, we assume that ||c||2 > p. If ||c||2 < p, the unconstrained 
optimal solution of eg. (22) is also its optimal solution, since r* = c is inside the feasible 
region given that \\r*\\2 < p. 

Lemma 22. For q > 0, (VA;) 7^ 0, ||c||2 > p, the separable quadratic trust-region problem 
in eg. (22) is equivalent to the problem: 



mm 

A>0 



^^^+p'X 



Furthermore, their optimal solutions are related by r* = diag(l + A*q)^"'^c. 



(23) 



Proof. By Lagrangian duality, the problem in eq.(23) is the dual of the problem in eq.(22). 
Furthermore, strong duality holds in this case. □ 

Corollary 23. For the special case q = 1 of Duchi and Singer \2009 the trust-region dual 



problem in eg. (23) has the closed form solution A* = max ( 



||C||2 
P 



Proof. For q = 1, the problem in eq.(23) becomes min;j>o 



Ml , .2- 



l+A 



+ p'\ 



By minimizing 

□ 



with respect to A and by noting that A > 0, we prove our claim. 

Theorem 24. For q > 0, p > 0, p = 2, the I2 regularized separable quadratic problem in 
eg. (13) has the optimal solution: 



|c||2 < p ^ X* = 

|c||2 > /O =^ X* = A*diag(l + A*q)"^c 



(24) 
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Algorithm 1 Block Coordinate Descent 



Input: p > 0, for each fc, S<'=' t 0, T*''' > 

Initialize for each k, 0'*=) = diag(S('=')"^ 

for each iteration 1, . . . ,L and each variable 1, . . . , A'^ do 

Split for each k, fi**' into W('=',y('=',z(*=' and S'*^' into S'''\u^''\v^''^ as described in eq.Q 
Update for each k, W'''' by using the Sherman- Woodbury-Morrison formula (Note that when iter- 
ating from one variable to the next one, only one row and column change on matrix W'*^') 
for each variable 1, . . . , A*' — 1 do 

Split for each k, W''=)~\ y**^' , ue^) as in eq.([l4| 

For p — oo, solve the loa regularized separable quadratic problem by eq.(21l, either by sorting the 
breakpoints or using medians of breakpoint subsets. For p = 2, solve the £2 regularized separable 
quadratic problem by eq.( 24 1 by using the Newton-Rap hson method for solving the trust-region dual 
problem in eq.(23l 
end for 

Update for each k, z'*' ^ ^ + y('=)'^w('=' ~ V**' 
end for 

Output: for each k, f2<'=' ^ 



Proof. For ||c||2 < p, 



solution of eq.(13) is x* = diag(q) 

> p, by Theorem 



from Remark |21| we know that r 

T 



c. By Theorem 19 the optimal 



For ||c||2 
By Lemma 



22 



c — r*) = 0, and we prove the first claim. 



Qk 



the optimal solution of eq.(13) is (V/c) 
Cfe) = i^rx*q^Ck, and we prove the second claim 



r 



kJ- 

□ 



7 Algorithm 

Algorithm [1] shows the block coordinate descent method in detail. A careful implementation 
of the algorithm allows obtaining a time complexity of 0{LN^K) for L iterations, vari- 
ables and K tasks. In our experiments, the algorithm converges quickly in usually L = 10 
iterations. The polynomial dependence 0{N^) on the number of variables is expected since 
we cannot produce an algorithm faster than computing the inverse of the sample covari- 
ance in the case of an infinite sample. For p = 00, the linear-time dependence 0{K) on 
the number of tasks can be accomplished by using a deterministic linear-time method for 
solving the continuous quadratic knapsack problem, based on medians of breakpoint sub- 
sets Kiwiel, 2007] . A very easy-to-implement 0{K log K) algorithm is obtained by initially 
sorting the breakpoints and searching the range for which Lemma 17 holds. For p = 2, 
the linear-time dependence 0{K) on the number of tasks can be accomplished by using 
the one-dimensional Newton-Raphson method for solving the trust-region dual problem 
in eq.(23). In our implementation, we initialize A = and perform 10 iterations of the 



Newton-Raphson method. 



8 Penalizing the Diagonals 

In this section, we discuss penalization of the diagonals of the precision matrices. It is 
unclear whether diagonal penalization leads to better models with respect to structure 
as well as generalization performance. For the single-task problem, covariance selection 
[Banerjee et al. 2006j and graphical lasso Friedman et al. , 2007| penalize the weights of 
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the diagonal elements. In contrast, the analysis of consistency in structure recovery of 



Ravikumar et al. [2008] assumed that diagonals are not penalized. 



Note that, when the diagonals are not penalized, the "diagonal update step" (Remark 
5) reduces to setting for each k, z^^^* = + y^'^^ W*^'^) ^y^*^^- Penalization of the diag- 
onals of the precision matrices is more involved, since it requires the solution of additional 
optimization problems, namely the continuous logarithmic knapsack problem for p = oo 
and the separable logarithmic trust-region problem for p = 2. First, we discuss the general 
problem for arbitrary p > 1. 

Lemma 25. When penalizing the diagonals of the precision matrices, the "diagonal update 
step" of the block coordinate descent method for the ii^p multi-task structure learning prob- 
lem in eq.^ is equivalent to solving a sequence of strictly convex £p regularized separable 
logarithmic subproblems: 



max 
(Vfc) zW>bk 



Qk log(z('=) - bk) - c^z - p||z||p (25) 



where z = {z^^\ . . . , z^^'^f^ and q, c, b > 0. Moreover, the block coordinate descent method 
generates a sequence of positive definite solutions. 

Proof. When we choose to penalize the diagonals of the precision matrices . . . , 
eq.(lO) contains an additional ip penalty, i.e. /o||z||p. In terms of the variables y^'^-', z^^'^ and 



the constant matrix W*^'') introduced in eq.([9|, the multi-task structure learning problem 
in eq.Q can be reformulated as: 

/V, tC') ('log(zW _ y(fc)'^wW"V^^^) - 2uW'^yW - 

(v/c) nw^o V-2pE„ \ • • • , - pWAv ) 

Let qk = T(^) > 0, Cfc = T^'^)^^^') > and bk = y(^O^w('=)" V^''^ 

> since W(^) ^ and 



y*^'^) is an arbitrary vector (including the case y^'^-' = 0). We obtain eq.(25) by noting that 
we are maximizing with respect to z and by enforcing (V/c) z'-'^^ > bk since log(2;('^) — bk) is 
undefined for z^^^ < bk. 

If ri^'^) is a symmetric matrix, according to the Haynsworth inertia formula, il^^^ ^0 if 
and only if its Schur complement z'^^'* — y^^^ W^'^) y^^) > and W^'^') >~ 0. Note that the 
Schur complement z^^^ — y^^^ W^^) y^*^) = z^^^ — bk and therefore it is strictly positive 
for every feasible solution given the constraints (V/c) z^^'^ > bk. 

Finally, in an iterative optimization algorithm, it suffices to initialize fi^^^ to a matrix 
that is known to be positive definite, e.g. a diagonal matrix with positive elements. □ 

Lemma 26. For q, c, b > 0, /? > 0, p > 1, the Ip regularized separable logarithmic problem 
in eg. (25) is equivalent to the separable logarithmic problem with one ip' constraint: 



min ^- ^ qk log(rfc + Ck) - b'^r j 



(27) 



m\p'=p 

Furthermore, their optimal solutions are related by z^^^* = bk + 



Qk 



Ck+rt ■ 
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Proof. By Lagrangian duality, the problem in eq.(27) is the dual of the problem in eq.(25). 
Furthermore, strong duality holds in this case. 

The constraint r > comes from the fact that z > since it is the diagonal of positive 
definite matrices. Note that for a general z G M.^ , we have /o||z||p = max||i.|| ,<p r'^z. In order 



to maximize this expression, r will take values on the non-negative orthant since z > 0. 

We changed the inequality constraint ||r||p/ < p to an equality constraint since the 
objective is separable and decreasing with respect to each and therefore, the optimal 
solution must be on the boundary of the constraint set. □ 

In what follows, we focus on the case p = oo and show that this problem can be solved 
by a combination of sorting and the Newton-Raphson method. 

Lemma 27. For q, c, b > 0, p > 0, p = oo, the separable logarithmic problem with one ipi 



in eq.(27) is the continuous logarithmic knapsack problem: 



min I -^gfclog(rfc + Cfc) - b'^r I 



(28) 



which has the solution: 



+00, 



(29) 



0, 



Cfe 



for some v, and furthermore, the optimal solution fulfills the condition: 

r* = t{v) 44> 1^y{u) = p 



(30) 



Proof. The Lagrangian of eq.(28) is: 



mm 

r>0 



(y- ^ qk log(rfe + Ck) - h^r + u{l^r - p)j 



(31) 



Both results can be obtained by invoking the Karush-Kuhn- Tucker optimality conditions 
on eq.(31 ). □ 



Remark 28. Note that for v < hk, we have rk{y) = +oo in eg. (29), therefore \^y{v) is 
finite if and only if v > maxkbk = ||&||oo- Additionally, for u > ||6||oo we have that rk{v) 



in eg'. (29) is a decreasing piecewise inverse function with breakpoint v 



3^+bk>0. By 



Lemma finding the optimal r* is equivalent to finding u in a piecewise inverse function 
l''"r(z^) that produces p. 



Similarly as in Lemma 17, we sort the breakpoints in decreasing order, i.e. we find a 



^ -1-5 > 



permutation vr of the indices 1,2, ... ,K such that 1p- + b-,^ > ^ + &7ro > • • • > 
'^"^'^"^ + bT^ = 0. Then, we search for the optimal breakpoint k* or equivalently we search 



for the range in which l"'"r 



+ h 



""fc*+i 



+ ^■^k*+i I ■ After finding k*, u* 
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can be found by the Newton- Raphson method in order to fulfih the condition l^r{i'*) = p 



in Lemma 27 In our implementation, we initiahze u at one of the extremes of the optimal 
range, i.e. v = max(||6||oo + e, + ^tt^.) for some small e > 0. We then perform 10 
iterations of the Newton-Raphson method. 

Next, we focus on the case p = 2 and show that this problem can be solved by the 
Newton-Raphson method. 

Lemma 29. For q, c, b > 0, p > 0, p = 2, the separable logarithmic problem with one £„/ 



in eq.(27) is the separable logarithmic trust-region problem: 



min i-^qk log(ryfc + Cfc) - b'^r | (32) 



\\r\\2<P 

which can be solved by one- dimensional optimization of: 

max ^ Qk log(rfe (A) + Ck) - h^r{X) + ^ f r (A)^r(A) -p']\ (33) 



where rfc(A) = b,-Xc,+^ib.+Xc,r +AM. 



Proof. By Lagrangian duality, the problem in eq.(33) is the dual of the problem in eq.(32). 



Furthermore, strong duality holds in this case. □ 

In our implementation, we initialize A = Ylk '^''p(cfc+p")'^^ ^^'^ perform 10 iterations of 
the Newton-Raphson method. Our initialization rule follows from using the average of the 
k independently optimal values of A. That is, we consider only one task k at a time from 
eq.([33l) which leads to maxA>o {-Qk ^og{rk{X) + Cfc) - bkrk{X) + f (?'|(A) - p"^)). Then, we 



compute the optimal value of A under this setting, which is '^'°p('efc+p|^^ ■ FmS'lly) we average 
these optimal values for all k which leads to our initialization rule for A. 



9 Experimental Results 

We begin with a synthetic example to test the ability of the method to recover the ground 
truth structure from data. The model contains = 50 variables and K = 5 tasks. For 
each of 50 repetitions, we generate a topology (undirected graph) G {0, with 
a required edge density (either 0.1,0.3,0.5). For each task k, we first generate a Gaussian 

(k) 

graphical model Ctg with topology Tg where each edge weight is generated uniformly at 
random from [—1; +1]. We ensure positive definiteness of Clg'^ by verifying that its minimum 
eigenvalue is at least 0.1. We then generate a dataset of T^^^ = 50 samples. 

In order to measure the closeness of the recovered models to the ground truth, we mea- 
sured the Kullback-Leibler divergence, sensitivity (one minus the fraction of falsely excluded 
edges) and specificity (one minus the fraction of falsely included edges). For comparison 
purposes, we used the following single-task methods: covariance selection [Banerjee et al. 
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0.2 0.4 0.6 O.i 

1 - Specificity 
Density 0.1 



0.2 0.4 0.6 0.8 
1 - Specificity 
Density 0.3 



0.2 0.4 0.6 0.8 
1 - Specificity 
Density 0.5 




Regularization parameter 
Density 0.1 



Regularization parameter 
Density 0.3 



Regularization parameter 
Density 0.5 



Figure 1: ROC curves (top) and cross- validated KuUback-Leibler divergence (bottom) between the recovered 
models and the ground truth for low (left), moderate (center) and high (right) edge density. Both our li^oo 
(MI) and ^1,2 (M2) multi-task methods recover the ground truth edges remarkably better than the £1^2 multi- 
task upper bound method (U2), Meinshausen-Biihlmann with AND-rule (MA), OR-rule (MO), graphical 
lasso (GL), covariance selection (CS) and Tikhonov regularization (TR). The KuUback-Leibler divergence 
of our £1^2 method is always lower than the £-1^2 upper bound method for all the regularization values. 



2006 , graphical lasso Friedman et al. , 2007 , Meinshausen-Biihlmann approximation [Mein- 
shausen and Biihlmann 2006 and Tikhonov regularization. We also compared our method 



to the ii 2 multi-task upper bound method of Varoquaux et al. 



2010 



Figure [T] shows the ROC curves and KuUback-Leibler divergence between the recovered 
models and the ground truth. Note that both our £1 oQ and £1 2 multi-task methods recover 
the ground truth edges remarkably better (higher ROC) than the comparison methods, in- 
cluding the £1^2 multi-task upper bound method of Varoquaux et al. 20101. Our £1^2 method 



always produces better probability distributions (lower Kullback-Leibler divergence) than 
the £1^2 upper bound method for all the regularization values. We did not observe a sig- 
nificant difference in Kullback-Leibler divergence between penalizing the weights in the 
diagonals versus not penalizing the diagonals for most regularization levels p < 0.19. For 
p > 0.19, diagonal penalization leads to a slightly worse Kullback-Leibler divergence. Fur- 
thermore, the ROC curves for our methods with and without diagonal penalization were 
the same. Therefore, we chose to report only the results without diagonal penalization. 

For experimental validation on a real- world dataset, we first use a fMRI dataset that 
captures brain function of cocaine addicted and control subjects under conditions of mon- 
etary reward. The dataset collected by Goldstein et al. p007j contains 16 cocaine addicted 
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Regularization parameter 

(a) 



Regularization parameter 

(b) 



Figure 2: Cross-validated log-likelihood of structures learnt for each of the six sessions on cocaine addicted 
subjects (a) and control subjects (b). Both our fi,oo (MI) and £1,2 (M2) multi-task methods have higher 
log-likelihood than the £1,2 multi-task upper bound method (U2), Meinshausen-Biihlmann with AND-rule 
(MA), OR-rule (MO), graphical lasso (GL), covariance selection (CS) and Tikhonov regularization (TR). 
Our £1^2 method is always better than the £1^2 multi-task upper bound method for all the regularization 
values. 



subjects and 12 control subjects. Six sessions were acquired for each subject. Each ses- 
sion contains 87 scans taken every 3.5 seconds. Registration of the dataset to the same 
spatial reference template (Talairach space) and spatial smoothing was performed in SPM2 
(http : //www. f il . ion.ucl . ac.uk/ spm/). We extracted voxels from the gray matter only, 
and grouped them into 157 regions by using standard labels (Please, see Appendix [A|), given 
by the Talairach Daemon ( [http : //www . t alairach . org/ ) . These regions span the entire 
brain (cerebellum, cerebrum and brainstem). In order to capture laterality effects, we have 
regions for the left and right side of the brain. 

First, we test the idea of learning one Gaussian graphical model for each of the six 
sessions, i.e. each session is a task. We performed five-fold cross-validation on the subjects, 
and report the log-likelihood on the testing set (scaled for visualization purposes) . In Figure 
[2| we can observe that the log-likelihood of both our £1^00 and £1,2 multi-task methods is 
better than the comparison methods. Moreover, our ii^2 method is always better than the 
£1^2 multi-task upper bound method of Varoquaux et al. 2010 for all the regularization 
values. We did not observe a significant difference in log-likelihood between penalizing the 
weights in the diagonals versus not penalizing the diagonals for most regularization levels 
p < 0.19. For p > 0.19, diagonal penalization leads to a slightly worse log-likelihood. 
Therefore, we chose to report only the results without diagonal penalization. 

Second, we test the idea of learning one Gaussian graphical model for each subject, i.e. 
each subject is a task. It is well known that fMRI datasets have more variability across 
subjects than across sessions of the same subject. Therefore, our cross-validation setting 
works as follows: we use one session as training set, and the remaining five sessions as testing 
set. We repeat this procedure for all the six sessions and report the log-likelihood (scaled 
for visualization purposes) . In Figure [sj we can observe that the log-likelihood of both our 
il 00 and £1 2 multi-task methods is better than the comparison methods. Moreover, our 



ti^2 method is always better than the £1^2 multi-task upper bound method of Varoquaux 
et al. [2010j for all the regularization values. Finally, both our ^1^00 and £1^2 methods are 
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Figure 3: Cross-validated log-likelihood of structures learnt for each subject on cocaine addicted subjects (a) 
and control subjects (b). Both our li^oo (MI) and £1^2 (M2) multi-task methods have higher log-likelihood 
than the £1^2 multi-task upper bound method (U2), Meinshausen-Biihlmann with AND-rule (MA), OR-rule 
(MO), graphical lasso (GL), covariance selection (CS) and Tikhonov regularization (TR). Our £1^2 method 
is always better than the £1^2 multi-task upper bound method for all the regularization values. For low 
regularization levels, our methods are more stable than the comparison methods. 



Method 


SI 


S2 


S3 


S4 


S5 


S6 


S7 


S8 


S9 


SIO 


Sll 


S12 


S13 


S14 


S15 


S16 


MA 


27.4 


14.7 


9.1 


12.0 


18.9 


10.4 


9.0 


19.6 


9.6 


8.1 


8.5 


17.2 


23.2 


19.2 


19.5 


10.3 


MO 


25.6 


17.0 


10.4 


13.7 


19.4 


10.4 


10.7 


20.4 


13.5 


11.2 


10.1 


18.5 


22.5 


17.6 


21.1 


13.9 


GL 


2.0 


2.7 


1.9 


1.5 


0.7* 


1.8 


2.7 


2.2 


4.3 


2.9 


1.8 


1.8 


1.8 


2.2 


4.7 


2.9 


CS 


2.0 


2.6 


1.9 


1.5 


0.7* 


1.8 


2.7 


2.1 


4.3 


2.9 


1.8 


1.8 


1.8 


2.2 


4.7 


2.9 


TR 


15.4 


5.1 


3.6 


6.3 


10.3 


6.7 


3.5 


12.0 


3.2 


2.2 


3.7 


8.8 


8.8 


11.4 


8.0 


5.0 


U2 


5.5 


2.3 


1.7 


2.0 


4.0 


1.9 


1.1* 


4.3 


1.3 


0.7* 


1.3 


3.9 


3.3 


3.9 


2.9 


1.2* 


M2 


0.8* 


-0.3* 


0.1* 


0.3* 


0.7* 


0.5* 


-0.1* 


1.0* 


-0.1* 


-0.2* 


-0.0* 


0.8* 


0.5* 


1.1* 


0.3* 


-0.3* 




Table 2: Z-statistic for the difference of log-likelihoods between our £i,oa technique and each other method, 
for 16 cocaine addicted subjects. Expect for few cases (marked with an asterisk), our method is statistically 
significantly better (90%, Z > 1.28) than the £1^2 upper bound method (U2), Meinshausen-Biihlmann with 
AND-rule (MA), OR-rule (MO), graphical lasso (GL), covariance selection (CS) and Tikhonov regularization 
(TR). The £1^00 and £1^2 (M2) methods are not statistically significantly different. 



more stable for low regularization levels than the other methods in our evaluation, which 
perform very poorly. 

In order to measure the statistical significance of our previously reported log-likelihoods, 
we further compared the best parameter setting for each of the techniques. In Tables [2] and 
[3| we report the two sample Z-statistic for the difference of both our £1^00 and £1^2 techniques 
minus each competing method. Except for few subjects, the cross- validated log-likelihood of 
both our .^i^oo and £1^2 methods is statistically significantly higher (90%, Z > 1.28) than the 



comparison methods, including the £1^2 multi-task upper bound method of Varoquaux et al. 



[2010 . Our £1^00 and £1^2 multi-task methods are not statistically significantly different. 



We show a subgraph of learnt structures for three randomly selected cocaine addicted 
subjects in Figure |4j We can observe that the sparseness pattern of the structures produced 
by our multi-task method is consistent across subjects. 

Next, we present experimental results on a considerably larger real- world dataset. The 
1000 functional connectomes dataset contains resting-state fMRI of over 1128 subjects col- 
lected on several sites around the world. The dataset is publicly available at http : / / www .1 
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3.7 
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3.4 


2.5 


3.8 


8.1 


8.5 


10.3 


7.9 


5.4 


U2 


4.8 


2.6 


1.7 


1.8 


3.3 


1.5 


1.2* 


3.4 


1.5 


0.9* 


1.4 


3.2 


2.8 


2.9 


2.6 


1.5 


MI 


-0.8* 


0.3* 


-0.1* 


-0.3* 


-0.7* 


-0.5* 


0.1* 


-1.0* 


0.1* 


0.2* 


0.0* 


-0.8* 


-0.5* 


-1.1* 


-0.3* 


0.3* 



Table 3: Z-statistic for the difference of log-likelihoods between our li^2 technique and each other method, 
for 16 cocaine addicted subjects. Expect for few cases (marked with an asterisk), our method is statistically 
significantly better (90%, Z > 1.28) than the £1^2 upper bound method (U2), Meinshausen-Biihlmann with 
AND-rule (MA), OR-rule (MO), graphical lasso (GL), covariance selection (CS) and Tikhonov regularization 
(TR). The ^i,oo (MI) and £1^2 methods are not statistically significantly different. 




(a) (b) (o) (d) 

Figure 4: Subgraph of ten randomly selected brain regions from learnt structures for three randomly selected 
cocaine addicted subjects, for (a) our £i,aa multi-task method, (b) our £1^2 multi-task method, (c) the £1,2 
upper bound method and (d) graphical lasso. Regularization parameter p = 0.01. Positive interactions are 
shown in blue, negative interactions are shown in red. Notice that sparseness of our structures (a,b) are 
consistent across subjects, while the remaining methods (c,d) fail to obtain a consistent sparseness pattern. 



nitrc.org/projects/fcon_1000/ Resting-state fMRI is a procedure that captures brain 
function of an individual that is not focused on the outside world, while his brain is at wake- 
ful rest. Registration of the dataset to the same spatial reference template (Talairach space) 
and spatial smoothing was performed in SPM2 (http://www.fil.ion.ucl.ac.uk/spm/). 
We extracted voxels from the gray matter only, and grouped them into 157 regions by 
using standard labels (Please, see Appendix [A]) , given by the Talairach Daemon (http : ' 
,/ / www . talairach . org/ J . These regions span the entire brain (cerebellum, cerebrum and 
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Table 4: Number of subjects per collection site and number of scans per subject in the 1000 functional 
connectomes dataset 



brainstem). In order to capture laterality effects, we have regions for the left and right side 
of the brain. Table [4] shows the number of subjects per collection site as well as the number 
of scans per subject. 

We learn one Gaussian graphical model for each of the 41 collection sites, i.e. each 
site is a task. For each site, we used one third of the subjects for training, one third for 
validation and the remaining third for testing. We performed six repetitions by making 
each third of the subjects take turns as training, validation and testing sets. We report the 
negative log-likelihood on the testing set in Figjs] (we subtracted the entropy measured on 
the testing set and then scaled the results for visualization purposes). We can observe that 
the log-likelihood of both our ii^oo and £1^2 multi-task methods is better than the comparison 
methods. Moreover, our £1^2 method is better than the £1^2 multi-task upper bound method 



of Varoquaux et al. 2010 . Our results also suggest that diagonal penalization does not 
produce better generalization performance. 

We show a subgraph of learnt structures for three randomly selected collection sites in 
Figure |6j We can observe that the sparseness pattern of the structures produced by our 
multi-task method is consistent across collection sites. 



10 Conclusions and Future Work 

In this paper, we generalized the learning of sparse Gaussian graphical models to the multi- 
task setting by replacing the -^i-norm regularization with an £i^p-norm. We presented a 
block coordinate descent method which is provably convergent and yields sparse and positive 
definite estimates. We showed the connection between our ^1^00 multi-task structure learning 
problem and the continuous quadratic knapsack problem, as well as the connection between 
our ii^2 multi-task structure learning problem and the quadratic trust-region problem. In 
synthetic experiments, we showed that our method outperforms others in recovering the 
topology of the ground truth model. The cross-validated log-likelihood of our method 
is higher than competing methods in two real-world brain fMRI datasets. For the £1^2 
problem, our block coordinate descent method leads to better ground-truth recovery and 



20 




TR MO MA CS GL U2 Ml Dl M2 D2 



Figure 5: Test negative log-likelihood of structures learnt for the 1000 functional connectomes dataset. 
Differences between our multi-task methods and the rest are statistically significant (99%, Z > 2.33). Both 
our fi,oo (MI) and ^1,2 (M2) multi-task methods (without diagonal penalization) as well as our £1,00 (DI) and 
ii^2 (E)2) multi-task methods (with diagonal penalization), have better log-likelihood than the ^1,2 multi-task 
upper bound method (U2), Meinshausen-Biihlmann with AND-rule (MA), OR-rule (MO), graphical lasso 
(GL), covariance selection (CS) and Tikhonov regularization (TR). Our £1,2 method is better than the ^1,2 
multi-task upper bound method. Our results also suggest that diagonal penalization does not produce better 
generalization performance. 




(a) (b) (c) (d) 

Figure 6: Subgraph of ten randomly selected brain regions from learnt structures for three randomly selected 
collection sites, for (a) our l\,oo multi-task method, (b) our ^1,2 multi-task method, (c) the £1,2 upper bound 
method and (d) covariance selection. Regularization parameter p = 0.0002. Positive interactions are shown 
in blue, negative interactions are shown in red. Notice that sparseness of our structures (a,b) are consistent 
across collection sites, while the remaining methods (c,d) fail to obtain a consistent sparseness pattern. 



generalization when compared to the upper bound method of Varoquaux et al. [2010 . 



We experimentally found that diagonal penalization does not lead to better generalization 
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performance, when compared to not penalizing the diagonals. Our methods with and 
without diagonal penalization recover the ground truth edges similarly well. Therefore, we 
believe the negative impact of diagonal penalization is not on structure recovery but on 
parameter learning. 

There are several ways of extending this research. In practice, our technique converges 
in a small number of iterations, but a more precise analysis of the rate of convergence 
needs to be performed. Model selection consistency when the number of samples grows to 
infinity needs to be proved. Finally, we hope the connection to the quadratic knapsack and 
trust-region problems will be useful for other multi-task problems, e.g. regression. 
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A List of Brain Regions 



We present the list of the 157 regions used in Section [9} Parenthesis, e.g. "(Left, Right) 
Amygdala", indicate that we used two regions: Left Amygdala and Right Amygdala. 

• Cerebellum: Cerebellar Lingual 

• Cerebellum: (Culmen, Declive, Pyramis, Tuber, Uvula) of Vermis 

• Cerebellum: (Left, Right) (Cerebellar Tonsil, Culmen, Declive, Dentate, Fastigium, 
Inferior Semi-Lunar Lobule, Nodule, Pyramis, Tuber, Uvula) 

• Cerebrum: Hypothalamus 

• Cerebrum: (Left, Right) (Amygdala, Claustrum, Hippocampus, Pulvinar, Putamen) 

• Cerebrum: (Left, Right) (Anterior, Lateral Dorsal, Lateral Posterior, Medial Dorsal, 
Midline, Ventral Anterior, Ventral Lateral, Ventral Posterior Lateral, Ventral Poste- 
rior Medial) Nucleus 

• Cerebrum: (Left, Right) Brodmann area (1,2,. . . ,47) 

• Cerebrum: (Left, Right) Caudate (Body, Head, Tail) 

• Cerebrum: (Left, Right) (Lateral, Medial) Globus Pallidus 

• Brainstem: (Left, Right) (Mammillary Body, Red Nucleus, Substantia Nigra, Sub- 
thalamic Nucleus) 



25 



